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Abstract 

With the advent of high-throughput wet lab technlogies the amount of protein inter¬ 
action data available publicly has increased substantially, in turn spurring a plethora of 
computational methods for in silica knowledge discovery from this data. In this paper, we 
focus on parameterized methods for modeling and solving complex computational problems 
encountered in such knowledge discovery from protein data. Specifically, we concentrate 
on three relevant problems today in proteomics, namely detection of lethal proteins, func¬ 
tional modules and alignments from protein interaction networks. We propose novel graph 
theoretic models for these problems and devise practical parameterized algorithms. At a 
broader level, we demonstrate how these methods can be viable alternatives for the several 
heurestic, randomized, approximation and sub-optimal methods by arriving at parameter¬ 
ized yet optimal solutions for these problems. We substantiate these theoretical results by 
experimenting on real protein interaction data of S.cerevisiae{bvAAhig yeast) and verifying 
the results using gene ontology. 


1 Introduction 

With the advent of many high-throughput wet lab technologies the amount of biological data 
available for analysis has increased substantially. For instance. Mass Spectrometry and Tan¬ 
dem Afinity Purification(MS-TAP), Yeast to Hybrid(Y2H) and ChIP-on-Chip experiments have 
contributed to large publicly accessible biological databases like BIND, Biogrid and MIPS [T]. 
This in turn has increased the need to make sense out of such large quantities of data and to 
extract useful and intelligent information from them. For instance, discovery of disease-causing 
genetic information from the population in a certain region, protein structural information from 
protein databases, phylogenetic distances from genomic data, etc. This can have applications 
in areas like drug discovery, population genetics and phylogenetics. 

However, in most cases, extraction of such intelligent and useful information is not easy. 
Many of the problems encountered are NP-hard, with researchers working on sub-optimal, 
heurestic-based, yet fast solutions for solving these problems. Such in silico knowledge discov¬ 
ery from biological data is a very complicated area of research. The techniques adopted span 
several areas of computing like algorithmic theory, statistics, machine learning, etc. However, 
no particular technique or set of algorithms can claim to solve a problem completely. Even if a 
particular technique solves a problem efficiently and exactly, the solution may turn out to be not 
useful for biologists. Biologists not always look for exact or fast solutions or complicated meth¬ 
ods. Many times the ‘biological relevance’ of the solution is very important. Hence, searching 
for solutions in computational biology is usually a subtle balance between specificity(measured 
in terms of false positives and false negatives), efficiency (speed) and biological relevance. This 
is one the reasons why many techniques developed these days are ensemble methods that mix 
and mash several methods each strong in one sense. 
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In this paper, we intend to show the recent area of Parameterized Complexity as an alter¬ 
native to existing techniques used to solve hard problems in computational biology. In order 
to show this, we take up several relevant problems motivated from computational proteomics 
and model them as interesting theoretical problems, thereby proposing parameterized algorithms 
for the same. In this manner, on one hand, we deal with interesting theoretical results which 
have relevance to computer science and on the other, concrete applications to computational 
biology. We do not claim that our techniques are most efficient or give best biologically relevant 
solutions, but they certainly are worthy alternatives to several existing techniques, and hence 
can also be adopted by ensemble methods. 

Problems addressed and related work: In this paper, we specihcally focus on knowledge 
extraction from protein interaction data. We try to solve two problems, 

(1) extraction of hubs and functional modules from protein interaction networks, and 

(2) alignment of protein interaction networks across species. 

Some previous works in this area have focused on heurestic methods m , linear programming 
and matrix methods |13] and clustering m- Kumar m gives a detailed survey on data mining 
based methods. Some applications of parameterized algorithms in computational biology can 
be found in the works of Bocker Neidermeier [10] . Langston m and some of our 

preliminary work on hubs and quasi cliques in [18]. Having given the references, we now begin 
our work with a brief background of the problems in focus: 

Protein networks: Proteins are organic compounds that form vital constituents of living 
organisms. They are responsible for several biological processes and pathways. It has been 
observed that proteins interact with one another while performing functions. These interactions 
are captured in the form of Protein-Protein Interaction or PPI networks that are now publicly 
available in databases such as BIND, MIPS and Biogrid [T]. PPI networks demonstrate the 
so-called scale-free properties. 

Scale-free properties of protein networks: Though not all biological networks demon¬ 
strate scale-free properties, it is believed that proteins networks(PPI) display most of these OH). 
Scale-free networks are those that follow power-law distribution in connectivies, that is, proba¬ 
bility that a node links to k other nodes is P{k) = where a is a small constant [B]. This 
leads to a skewed distribution - there exist a small set nodes with very large interactions while 
others have far less interactions. Such a phenomenon gives rise to two interesting structures 
namely, ‘hubs’ and ‘communities’. Hubs in protein networks represent lethal or important pro¬ 
teins that interact with most other proteins and hence hold the network intact. Most lethal 
proteins are strategically located and their disruption could possibly lead to biological lethal¬ 
ity. Hence, their study is important to understand the causes of diseases. For example, the 
deletion of the myosin pair myo3-myo5 causes severe defects in growth and cytoskeleton or¬ 
ganization |2|. Communities refer to functional modules. Proteins within a community form a 
densely connected region within the network and perform specific biological functions. For in¬ 
stance, functional categories of proteins involved in stress response, biosynthesis of vitamins and 
prosthetic groups HU. And therefore detecting communities is relevant to understand protein 
affinity in biological processes. 

Protein network alignment: Researchers usually derive phylogenetic or evolutionary dis¬ 
tances between species by comparing their genome sequences - calculating edit distances, rever¬ 
sals, etc. However, some recent works [U have shown that alignment of protein networks can 
also lead to useful clues in constructing phylogenetic trees. Alignment here refers to deriving 
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local isomorphisms between protein networks of two or more species to understand the simi¬ 
larity traits between them; analogous to sequence alignments. Protein network alignments can 
be useful to derive network motz/s(frequent local network patterns) across protein networks of 
species [S]. 

2 Preliminaries 

Parameterized Complexity can be considered as a two-dimensional generalization of the clas¬ 
sical complexity theory, with one dimension being the size of the input instance, |/| = n and 
second being a fixed input parameter k 019]. Parameterized complexity studies the hardness of 
problems based on these two dimensions and does a more finer classification of problems into: 
FPT < W[l] < Ty[2]... Among these classes is the class of FPT or Fixed-parameter Tractable 
problems which are interesting because they may be practically solvable. Formally, a problem 
is in the class FPT if it can be solved in time 0(/(A:)n^^^^), where / is a function of the fixed 
parameter k alone m- If under certain practical conditions, k turns out to be ‘small’ com¬ 
pared to n, then these solutions turn out to be more tractable compared to the trivial O(n^) 
solutions for such problems. Even more importantly, FPT solutions arrive at optimal solutions 
(subject to parameterization) and hence form interesting alternatives to approximate, random¬ 
ized or heurestic approaches. And hence, it is worthwhile exploring parameterized alternatives 
to problems in computational biology because many times missing out some results due to 
approximation may lead to missing out on biologically relevant cases. 

Throughout this paper, we model protein networks as simple undirected graphs - proteins 
as nodes and their interactions as edges. Given a graph G = {V,E), n represents number of 
vertices, and m represents the number of edges. For a subset V' C V, by we mean the 

subgraph of G induced on V'. By N{v) we represent all vertices (excluding v) that are adjacent 
to V, and by N[v], we refer to N{v) U {u}. The degree of v is d{v) = |iV(u)|. 

3 Modeling hubs and functional modules in protein networks 

We first deal with the problem of extracting hubs and functional modules. Most previous ap¬ 
proaches have looked at the problem of detecting functional modules separately from that of 
hub proteins. However, hub proteins form the key constituents of most functional modules. In 
any functional module, the hub proteins not only hold the other proteins within that module to¬ 
gether, but also determine the most dominant functions of that module. Also, protein networks 
are usually very sparse with several singleton (isolated) and loosely-connected proteins. Such 
proteins may be connected to one or more hub or non-hub proteins, but generally do not take 
part in any functional module. Taking such proteins into consideration induces false positives, 
thereby affecting the specificity of the results. As a result of these observations, we approach 
the whole problem in the following three steps: 

Step 1 : Determine all the hub proteins within the network. 

Step 2 : Determine all the proteins(hubs and non-hubs) which are more likely to take part 
in some functional module(s) and filter away the rest, 

Step 3 : Combine the information from above steps to determine the individual functional 
modules. 

We now describe these steps in the following paragraphs: 

Theoretical modeling: The main theoretical counterpart of our practical problem in hand 
is what we call the list dominating set problem: 
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LIST DOMINATING SET: For a given graph G = {V, E), an integral list L = {l{vi), • • •, l{vn)}, 
0 < l{vi) < d{vi), 1 < i < n, and a positive integer k, a subset D CV is called L-dominating if 
for every Vi G V, either Vi ^ D or \N{vi) H D| > l{vi). The problem asks if we can find a D 
such that \D\ < k? 

This was first introduced in m under the name of ‘vector dominating set’. The ‘vector’ 
L gives the number of neighbors any vertex v & V \ D needs to have in D. This is an in¬ 
teresting problem which generalizes the classical vertex cover problem when l{vi) = d{vi), 
and the dominating set problem when l{vi) = 1. It is NP-complete in the general complex¬ 
ity domain, but in the parameterized domain, it scales a whole ‘tower’ of complexities from 
FPT(vertex cover) to W[2]-hard(dominating set), and the ‘jump’ has been studied by us in pH] , 

In this paper, we show how this single problem can serve our dual purpose of modeling 
hubs and participating proteins (both steps 1 and 2) very well. However, since in general it is 
W[2]-complete [20], we require a suitable reformulation of the problem to arrive at a practical 
solution, which we propose as follows: 

{k, A)-list dominating set: For a given graph G, an integral list S = {s(ui), • • ■, s(un)}, 
0 < s{vi) < d{vi), 1 < i < n and positive parameters k and A, a subset ID C P is called 
(/c, A)-dominating if for every Vi G V, either Uj G ID or \N{vi) Pi ID| > d{vi) — s{vi). The problem 
asks whether we can find a ID, such that |ID| < k and < A? 

Here, we refer to S as ‘slack’ list, given by S = {s(ui), • • •, s(u„)} = {d{vi) — l{vi), • • •, d{vn) — 
l{vn)}- For a vertex Vi outside ID, d{vi) — s{vi) gives the number of neighbors of Vi in D 
and s{vi) gives the number of neighbors outside ID, while A is an additional parameter intro¬ 
duced that bounds the number of edges allowable between the vertices outside D. Note that if 
^ s-f most ^ edges are allowable between the vertices outside D. 

Next, we show how this new formulation models the steps stated above very well: 

Step 1 - Modeling hubs: Intuitively, hubs are high-degree nodes with which most of other 
nodes in the network interact. Hubs have been previously modeled as vertex covers m or 
by just blindly picking all nodes with degrees above a certain threshold. Modelling hubs by 
vertex covers forces all linkages to be covered while threshold-based approach may miss some 
important hubs that have degrees below the threshold. So, what is required is a suitable 
‘tunable’ parameter that ensures that there is some ‘relaxation’ in the modeling. 

Therefore, we model hubs by (fc, A)-lds for which the s{vi) values are ‘small’. This results 
in set D becoming the hub-set for the graph G covering all but s{vi) edges incident on Vi G 
V \ D. In the special case of s{vi) = 0, ID becomes the the vertex cover of G. However, the 
advantage (fc, A)-lds offers is the flexibility by means of ‘tunability’ of S. For any vertex, we 
can control the number of neighbors inside and outside D by tuning S. 

Such a tunability models real-world networks like PPI very well. For instance, any non-hub 
protein can have few direct interactions with other non-hub proteins to achieve certain biological 
functionalities, but at the same time be densely linked to hub proteins [18] . 

Step 2 - Modeling participating proteins: Technically, we need to arrive at the main core 
that contains the participating proteins (involves hubs and non-hubs) of all the functional mod¬ 
ules in the network. We model this core as a relatively dense sub-network in the protein network. 
We essentially look for a single(possibly huge) such sub-network. Protein networks being usu¬ 
ally very sparse, the density of such a core may not be very high, but by finding one we look 
to filter away all the isolated and loosely-connected proteins. A core G is associated with two 
parameters, 7 , the edge density and ICI, the size of the core. The density 7 is defined as the 
ratio of the number of edges in G to the total number of edges in a complete subgraph of the 
same size. 
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It is interesting to note that when we consider the complement of a given graph as an instance 
of the {k, A)-lds problem to find the set D, the remaining vertices (in V\D) form a (re —/c)-size 
core C of the original graph. s{vi) gives the maximum number of edges missing from every 
vertex Vi in the core. Since S„.s(uj) < A, at most ^ edges can be missing from the core, which 
gives a lower bound on 7 . Again S acts as a ‘tunable’ vector to control the sensitivity of the 
core. Hence, we can use our formulation to not only bound the number of missing edges but 
also specify the edges corresponding to which vertices can be missing. 

Step 3 - Combining the information: The above steps make it easier to determine the 
individual functional modules. One natural way to proceed is to consider each hub h €z D and 
form the set FM{h) = hU [N{h) H C], that is, FM{h) is the set of all vertices in C that are 
adjacent to hub h. We then calculate the edge densities(or average degrees) of each set and rank 
them. We consider each set as a functional module and verify it based on shared gene-protein 
ontologies. 

4 Modeling of disjoint functional modules 

The modeling described in the previous section has the flexibility of giving overlapping func¬ 
tional modules, that is, same proteins participating in more than one module. However, many 
times one looks for disjoint or non-overlapping functional modules that are expected to perform 
different functions. In this section, we show a model to determine such non-overlapping modules 
and to do so we introduce the following abstract problem: 

CLUSTER EDITING: Given an input graph G = {V,E) and positive integers k and A, this 
problem asks whether we can modify G to consist of disjoint clusters by adding or deleting at 
most k edges such that total sum of the edges missing across the resultant clusters is at most A. 

By clusters we mean partial cliques which may have ‘few’ edges missing compared to cliques 
(complete subgraphs). The QCE problem models disjoint modules in a straightforward manner. 
After k edge edit operations, if we can obtain a solution, the resultant graph will be disjoint union 
of multiple clusters. In protein networks, each such cluster could represent a disjoint module. A 
similar modeling by Niedermeier et al. m considers cliques for each disjoint module, however 
cliques are too restrictive and do not cater to naturally occuring communities in networks. 

5 Modeling protein network alignments 

We next concentrate on the second problem we mentioned earlier namely, protein network 
alignments across species. Modeling of protein network alignments leads us to local graph 
alignment, which we call quasi isomorphism. By quasi isomorphism we mean that given two 
labeled graphs (labeled by protein annotations), we try to find local regions in the graphs that 
are ‘highly’ similar or ‘highly’ isomorphic. This can be considered as a relaxation on the local 
graph isomorphism problem where one tries to find exact one-to-one correspondence between 
the labeled nodes and edges in two regions. 

In order to model quasi isomorphism, we first state the concept of product graph of two 
graphs. Let GiiVi^Ei) and G 2 {V 2 ,E 2 ) be two given graphs. Let I be any function that maps 
the labels of vertices of Gi to G 2 . We can construct the product graph H of Gi and G 2 as 
follows: 

(a) In graph Gi, if there is an edge ei = (rei, ui) G Ei between vertices labeled ui,vi G Hi (or 
if there is no edge between vertices labeled ui,vi G Hi), and in graph G 2 , there is a corresponding 
edge 62 = {u 2 ,V 2 ) G E 2 between vertices labeled U 2 = l{ui),V 2 = l{vi) G H 2 (or if there is no 
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edge between vertices labeled U 2 = l{ui),V 2 = I{vi) G V 2 ), then we add vertices labeled 
and {u 2 ,V 2 } to the vertex set of H and add an edge between them. 

(b) In all other cases, we do not add any labeled vertices nor edges to H. 

An interesting observation here is that a core C in the product graph H corresponds to local 
regions in graphs Gi and G2 that are ‘highly’ similar. That is, 

(a) the vertices of these two local regions display labeled mapping. If the region of Gi has 

vertex ui, then the region of G 2 will have vertex U 2 = and 

(b) there is partial correspondence between the edges in these two local regions, that is, 
there can be a bounded number of mismatches. By mismatches we mean if there is an edge 
{ui,vi) in the local region of Gi , then there is no edge between U 2 = I{ui ) and V 2 = livi) of G 2 
or vice versa. These mismatches are bounded by the number of edges missing in the core C of 
the product graph H. 

To apply this to PPI networks, we first build the product graph H based on protein labelings 
and then find the core G in H. Here, the mapping function I can be an identity function l(x) = x 
or can depend on the annotation schemes used(for example, the same protein could be labeled 
by different names in different species). 

6 Parameterized algorithms 

We dedicate this section to propose parameterized algorithms for all the theoretical problems 
introduced till now. 

6.1 Solution to (A:, A)-lds 

We begin by giving the following useful lemma: 

Lemma 1 If for Vi € V, d{vi) > k + ^, then Vi is part of every {k, X)-dominating set D. 

Proof: Observe that if d{vi) > k + ^ and Vi ^ V \ D, then D will be able to accommodate 
at most k neighbors of v forcing the remaining neighbors to be in P \ D. This would mean 
^viev\Ds{vi) > A. Hence, Vi needs to be part of every (A, A)-dominating set D. □ 

This leads to the following FPT algorithm for the {k, A)-lds problem: 

Theorem 1 (A:,A)-lds is FPT in k and A. 

Proof: Initially we set U = V and D = 0. And our parameters are k > 0 and A > 0. We use 
Lemma [U to arrive at data reduction techniques which, in parameterized complexity literature, 
are called Kernelization Rules: 

If there is a vertex Vi such that divi) > k + then do D := DiJ {vi} and U := U \ {vi}. 
We apply the rule exhaustively till there are no more vertices Uj. If the resultant solution set 
D has more than k vertices, then there exists no solution and we return NO. 

After application of the rule, all vertices Vi G U have d{vi) < k + ^. Therefore, the number 
of edges in the remaining induced subgraph G\y \ D] is at most {k — \D\){k + A) -g if it is 
to have a solution. The reason for this is: D can accommodate k — \D\ more vertices, each of 
which has degree at most k + Hence, moving any vertex from V \ D to D can cover at most 
k + ^ edges. Plus, allowable | edges in G[P \D]. If the number of edges is more than this, we 
return NO. This ends our kernelization step. 

Next, we then perform a Depth-bound Search on the remaining graph. We set our new 
parameter to k' = k — \D\. At every step of the search we maintain two partitions of U: 
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• X: for every Vi G X, |A^(fi) Ci D\ > d{vi) — s{vi), 

• F: for every Vi G F, \N{vi) fl D| < d{vi) — s{vi). 

The partition X consists of all vertices Vi that have their required (at least d{vi) — s(vi)) 
neighbors in D, while the partition F has all other vertices of U. We refer the vertices in X 
as saturated and those in F as unsaturated. 

We pick an edge {u, v) from the remaining graph and branch upon the following conditions: 
either {u,v) is in G[F \ D] or it is not. If not, then either u or u is in D. So, we recursively 
solve the problem by performing a three-way branching at each step. More specifically, if we 
move a vertex u(or v) into D, we set D := D U {u} (or D := D U {u}) and U := U \ {u} (or 
U :=U\{v}). By doing this, if we can find vertices re G F such that |A/'(rc)nZ)| > d{w) — s{w), 
then we move w from F to X, by doing X := X U {rc} and Y := Y — {re} (that is, w is now 
saturated). We reduce the parameter k' by 1. However, if the edge (u, v) is retained in G[V\D], 
we reduce the parameter ^ by 1. At any particular step, if A: = 0 and F / 0, we return NO. 
Else, we return yes and the solution set D. 

The correctness of the algorithm is clear from the description. For the time complexity, 
observe that at each recursive step, we perform a three-way branching and the depth of the 
recursion tree is at most d = k + ^. Hence, the total number of nodes in the tree is bounded 
by . Since we spend polynomial time for kernelization and at each step of the search tree, 
the complexity of our algorithm is bounded in the worst case by 0(3^^'*' 2 ^n^), which is FPT in 
k and A. □ 

Pruning the search tree: This is based on the observation that when we pick an edge (u, v) 
on any recursive call, if s{u) = 0 or s{v) = 0 or A = 0 then we can avoid the third branch 
altogether. Hence, the complexity can be reduced to 0(c*'^“'“ 2 < c < 3. 

We now derive a relationship between (k, A)-lds and core through the following lemma. 


Lemma 2 For a graph G = {V,E), let S = {s(ui), • • •, s(un)} be an integral vector, k and 
A be positive parameters. If G has a {k,X)-list dominating set D of size at most k such that 
^viev\Ds{vi) < A, then the subgraph G[V \ D] of the complement graph G of G, is a core G of 
size at least [n — k) with density (” 2 ^) < 7 < 1 - 


Proof: If iA C F is the desired (A:,A)-lds of G such that |iA| < k and T,y.^y\Ds{vi) < A, then 
we can infer two points: the size of F \ D is at least (n — k) and every vertex Vi £ V \ D has at 
most ^ neighbors in V \D. In other words, the subgraph induced on F \ ZA has at least (n — k) 
vertices with at most ^ edges between the vertices. Hence, the complement of the subgraph 
will also be of size at least {n — k), but will have at most ^ edges missing. So, the total number 
of edges in the complement subgraph would be at least {(” 2 ^) — -f}. Hence, we get a core with 

density 7 : {(V) “ |}/(V) < 7 < 1- □ 

This results in the following corollary for core in the original graph, 

Corollary 1 For a graph G = (F, E), let S = {s(ui), • • •, s(un)} be an integral vector, k and A 
be positive input parameters. If the complement graph G has a {k,\)-lds D such that \D\ < k 
and Yy.sivi) < A, then the subgraph G\V\D] of the original graph G is a core G of size at least 
(n - k) with density < 7 < 1 . 


6.2 Solution to cluster editing 

We give a search tree-based FPT algorithm for this problem. We search for conflict triples 
in the graph. A conflict triple {u, v, w} has vertices u and v connected by an edge and u is 
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connected to a vertex w that is not connected to v. Snch a conflict triple compels us to do one 
of four things: (a) add an edge (f,rc), (b) remove the edge {u,v), (c) remove the edge {u,w) 
(d) allow the triple to be as it is. For each of the first three options, we will require to do an 
edit operation, and hence counts to the parameter k. The last option counts to the parameter 
A. This is because, by allowing the conflict triple to remain as it is, we are allowing the cluster 
containing this triple to have a missing edge. By branching on each of these cases, we can arrive 
at a FPT algorithm. 

One way to improve the algorithm would be to maintain markers permanent and forbidden. 
When an edge (u, v) is added, it is marked permanent so that subsequent calls do not remove 
it. Similarly, when an edge (tt, v) is removed, it is marked forbidden so that subsequent calls do 
not add it. With these markers, we can arrive at better worst-case bounds for the search tree. 

6.3 Solution to quasi isomorphism 
We state the following lemma and explain it. 

Lemma 3 Let H be the product graph of two given graphs Gi and G 2 ■ A core G in H with at 
most ^ edges missing corresponds to local quasi isomorphism between Gi and G 2 with at most 
^ mismatches. 

Let G be the core of H obtained by the algorithm in Theorem [TJ If C has ^ edges missing 
then these edges correspond to ^ mismatches between two local regions of Gi and G 2 . That is, 
if an edge between vertices labeled {ui,ni}, {u 2 ,V 2 } of H is missing in C, then either (a) edge 
(wi,ni) € El (of Gi) and edge {u 2 ,V 2 ) ^ E 2 (of G 2 ), or (b) edge (rti,ni) ^ Ei (of Gi) and edge 
{u 2 ,V 2 ) € E 2 (of G 2 ). The number of such mismatches is bounded by 

7 Experiments and results 

Experimental setup: Here, we only show the experimental results of modeling hubs and 
functional modules using the {k, A)-lds problem. We implemented the algorithm of Theorem 
1 on an Intel Xeon 2.4GHz 4GB RAM Debian machine. The protein interaction dataset is 
of Saccharomyces Cerevisiae (budding yeast) from the Biogrid database [T]. After cleaning and 
removing self and multi edges, the resultant network had 1562 proteins and 1408 edges. The 
highest degree of the network was 46. About 5% of proteins had degrees in the range 20-35, 
while rest were below 3. We left the isolated proteins to be filtered away by the algorithm 
instead of during cleaning. 

Setting the parameters: k is the parameter that determines the size of the solution D. 
Through several experimental runs we have noticed that the algorithm works very efficiently 
for scale-free networks and typically gives a solution for small(relative to n) values of k. This 
is mainly because of the presence of few high-degree nodes which when picked cover most of 
the interactions. The complements of sparse scale-free networks turn out to be dense and a 
small k typically does not give a solution. If k is very small, too many nodes will get included 
into the solution during kernelization, thereby overshooting the solution size. If k is too large, 
the problem may not kernelize well. But a large k does not necessarily make the problem in¬ 
tractable in practice because the theoretical bound is essentially a worst case bound. A is set 
much smaller compared to the total number of edges. We chose to keep slack values of the 
vertices proportional to their degrees by fixing a division factor r while setting vector S. 

Detection of hubs and participating proteins: Some notations used to represent the 
results are: m! is number of edges in the complemented graph G, r is the division factor and 
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the values of the vector S are set as s{v) = for all u G 7 is the density of the core in the 
original graph G and T is the time in seconds. 

Table [T] gives the results of detecting hub proteins on four runs with different parameter 
settings. The slack values for most vertices are usually very small because of the division factor 
r. The higher these slack values, the higher the number of slack edges allowable and hence, 
smaller the size of solution set D. Secondly, in practice the algorithm runs very fast eventhough 
the theoretical worst case bound is exponential. Table [2] gives brief biological descriptions of 
few hubs that were discovered. The myosin pair myo3-myo5 was present in our dataset and 
the algorithm has successfully found it. Their complete biological descriptions can be obtained 
by searching for the protein name in the Biogrid website [l]. The full list of hubs discovered 
can be got from [22]. Table Ogives the results of four runs on the complemented protein 
network. The points to notice here are that the complements of protein networks are usually 
dense. The algorithm runs slower on these complemented networks and hence discovering a 
core takes longer time than hubs. Also, the lesser the slack values in the complemented graph, 
the higher the density 7 of the core in the original graph. This is because of fewer slack edges 
allowable among vertices of set C in complemented graph and hence, lesser the number of edges 
missing in the core of the original graph. 
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Table 1: Hubs: four runs on yeast protein network 


Hub protein 

Description 

GO annotations 

MY05 

MY03 

RVS167 

VRPl 

deletion with MY03 affects growth,cytoskeleton org 
deletion with MY05 affects growth,cytoskeleton org 
actin-associated;endocytosis; 
cytokinesis;mammalian WASP syndrome 

myosin binding;exocytosis 
microfilament motor activation 
cytoskeleton protein binding;osmotic stress 
actin binding;cytoskeleton org 


Table 2: Descriptions of some hubs verified through the Biogrid [Tj 


n 

m 

m ' 

k 

A 

r 

\ D \ 

|C| 

7 

T 

1562 

1408 

1217733 

1350 

2000 

3 

1047 

515 

0.0267 

10 

1562 

1408 

1217733 

1400 

2000 

4 

1178 

384 

0.0484 

11 

1562 

1408 

1217733 

1350 

2000 

5 

1256 

306 

0.0833 

11 

1562 

1408 

1217733 

1350 

2000 

6 

1303 

259 

0.1091 
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Table 3: Participating proteins; four runs on the complemented yeast protein network 


Detection of functional modules: Some of the functional modules detected are shown 
in Table 31 Eventhough the density of the core C of participating proteins is small, the densities 
of the individual functional modules is very high. So, the Steps 1 and 2 have indeed helped in 
gathering all the proteins more likely to participate in some functional module(s) and made it 
easier to filter away non-participating proteins. Gene Ontology based verfication through [23j 
also shows high percentage of common Process, Function and Component ontologies shared 
by the proteins involved in these modules. In this verification we submit the set of proteins 
belonging to a module onto the GO browser and gather statistics about the shared properties. 
This also also shows several common functions are shared across modules. 
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ID 

Size 

Density 

% shared 

p-value 

Description 

FMl 

10 

0.667 

96% 

1.27e-5 

Post-trans mod; cell division; phosphorylation 

FM2 

18 

0.561 

90% 

9.75e-9 

Protein-kinase activity; receptor signaling; transferase activity 

FM3 

22 

0.545 

76% 

8.32e-8 

Protein-kinase activity; MAP kinase; microfilament motor 

FM4 

18 

0.543 

80% 

9.75e-9 

Protein-kinase activity; phosphotransferase activity; signal transducer 


Table 4: Functional modules verified through Yeast GO [23] 


Discussions: The detection of individual functional modules in Step 3 is just one possible 
approach. The information obtained through hubs and participating proteins can be used to 
detect functional modules and complexes using other algorithms as well. For instance, the 
algorithm of m uses a heurestic search for finding cliques and ‘near cliques’ on a protein 
network. This algorithm can be run on the subgraph formed by the participating proteins 
instead of the whole protein network considering a hub as the start node. Similarly, certain 
clustering and random walk based methods given in m which use random nodes as the initial 
start points can also make use of our methods for finding the start nodes. Therefore, the hub 
and participating protein based information can be used in conjunction or as filtering steps with 
other methods to arrive at both more efficient as well as accurate results. 

From a technical point of view too we get a lot of insights into the problem from these 
experiments. We ran our algorithm on some standard, scale-free and random graphs. We 
noticed that n and \D\ are almost linearly related. Also, D is largest when the slacks are zero, 
that is, vertex cover of the graph is the largest solution set for the (fc, A)-lds problem. As the 
slacks are increased, the size of |D|(and hence the parameter k) as well as the time T to arrive 
at a feasible solution are reduced. 


8 Conclusions and future work 

We discussed parameterized modeling and solutions to several practical problems in proteomics. 
Through this we showed how parameterized approaches can be viable alternatives to previously 
known sub-optimal techniques. We are in the process of biological and statistical analysis of the 
experimental results. For instance, checking (a) how many actual lethal proteins missed by the 
algorithm eventhough they were present in the dataset, (b) how many non-hubs falsely detected 
as hubs, (c) any new hub not present in public databases but found in our dataset, etc. We are 
also in the process of experimenting on the cluster editing and quasi isomorphism problems on 
real data sets. There is a lot of scope for related future research as well. Firstly, one can look 
at more efficient parameterized algorithms for the same problems, especially arriving at a linear 
kernel for the (fe, A)-LDS problem. Secondly, one can look at methods for graph partitioning and 
applying our algorithms for each partition, thereby increasing the scalability of the approach. 
Thirdly, the current work can motivate researchers to look at parameterized approaches to other 
problems in this field, for instance, protein function and structure prediction. 
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